{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 入门云原生AI - 提交Bert训练任务\n",
    "BERT是Google开发的一种nlp领域的预训练语言表示模型，BERT在11项NLP任务中夺得非常好的结果，Google在11月份开源了bert的代码，同时发布了多种语言版本的模型。我们可以通过arena 提交bert模型的训练代码，非常方便地利用这项学术红利。\n",
    "\n",
    "在这个示例中，我们将演示：\n",
    "* 利用Arena提交Bert的pretraining训练任务，并且查看训练任务状态和日志。\n",
    "\n",
    "> 前提：请先完成文档中的[共享存储配置](../docs/setup/SETUP_NAS.md)，当前${HOME}就是其中`training-data`的数据卷对应目录。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.下载Bert样例源代码到${HOME}/models目录"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into '/root/models/bert'...\n",
      "remote: Enumerating objects: 317, done.\u001b[K\n",
      "remote: Total 317 (delta 0), reused 0 (delta 0), pack-reused 317\u001b[K\n",
      "Receiving objects: 100% (317/317), 254.03 KiB | 149.00 KiB/s, done.\n",
      "Resolving deltas: 100% (178/178), done.\n",
      "Checking connectivity... done.\n"
     ]
    }
   ],
   "source": [
    "! git clone \"https://github.com/google-research/bert.git\" \"${HOME}/models/bert\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2.下载pretraining 任务所需要的语料数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100  388M  100  388M    0     0  12.6M      0  0:00:30  0:00:30 --:--:-- 12.2M\n",
      "Archive:  uncased_L-12_H-768_A-12.zip\n",
      "   creating: uncased_L-12_H-768_A-12/\n",
      "  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta  \n",
      "  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  \n",
      "  inflating: uncased_L-12_H-768_A-12/vocab.txt  \n",
      "  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index  \n",
      "  inflating: uncased_L-12_H-768_A-12/bert_config.json  \n"
     ]
    }
   ],
   "source": [
    "! mkdir -p ${HOME}/dataset/bert\n",
    "! cd ${HOME}/dataset/bert && \\\n",
    "    curl -O http://kubeflow.oss-cn-beijing.aliyuncs.com/uncased_L-12_H-768_A-12.zip && \\\n",
    "    unzip uncased_L-12_H-768_A-12.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.创建训练结果的输出目录 ${HOME}/output/bert"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "! mkdir -p ${HOME}/output/bert"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.查看目录结构。\n",
    "* `dataset/bert` 是数据目录，用于存储训练所需的数据。\n",
    "* `models/bert` 是模型代码目录，用于存储模型训练的代码\n",
    "* `output/bert` 是训练结果目录，存放训练结果模型和checkpoint。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/root\r\n",
      "|-- dataset\r\n",
      "|   `-- bert\r\n",
      "|-- models\r\n",
      "|   |-- bert\r\n",
      "|   `-- tensorflow-benchmarks\r\n",
      "`-- output\r\n",
      "    `-- bert\r\n",
      "\r\n",
      "7 directories, 0 files\r\n"
     ]
    }
   ],
   "source": [
    "! tree -I ai-starter -L 2 ${HOME}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5.检查可用GPU资源，训练开始前，我们要保证有足够的空闲GPU资源"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NAME                                   IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)\r\n",
      "cn-zhangjiakou.i-8vb2knpxzlk449e7lugx  192.168.0.209  <none>  1           0\r\n",
      "cn-zhangjiakou.i-8vb2knpxzlk449e7lugy  192.168.0.210  <none>  1           1\r\n",
      "cn-zhangjiakou.i-8vb2knpxzlk449e7lugz  192.168.0.208  <none>  1           0\r\n",
      "cn-zhangjiakou.i-8vb7yuo831zjzijo9sdw  192.168.0.205  master  0           0\r\n",
      "cn-zhangjiakou.i-8vbezxqzueo7662i0dbq  192.168.0.204  master  0           0\r\n",
      "cn-zhangjiakou.i-8vbezxqzueo7681j4fav  192.168.0.206  master  0           0\r\n",
      "-----------------------------------------------------------------------------------------\r\n",
      "Allocated/Total GPUs In Cluster:\r\n",
      "1/3 (33%)  \r\n"
     ]
    }
   ],
   "source": [
    "! arena top node"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.通过Arena提交一个bert 创建pretrainingData 的训练任务, 用于创建Bert pretraining所需要的tfrecord文件。\n",
    "这里`training-data` 是在配置[共享存储时](../docs/setup/SETUP_NAS.md)创建的NAS存储声明.   \n",
    "`--data=training-data:/training` 将其映射到训练任务的`/training`目录。\n",
    "* `/training`目录下的子目录`/training/models/bert` 是步骤1拷贝源代码的位置\n",
    "* `/training`目录下的子目录`/training/dataset/bert` 是步骤2下载数据的位置\n",
    "* `/training`目录下的子目录`/training/output` 就是步骤3创建的训练结果输出的位置。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: PRETRAIN_DATA_JOB_NAME=bert-create-pretrain-data\n",
      "configmap/bert-create-pretrain-data-tfjob created\n",
      "configmap/bert-create-pretrain-data-tfjob labeled\n",
      "service/bert-create-pretrain-data-tensorboard created\n",
      "deployment.extensions/bert-create-pretrain-data-tensorboard created\n",
      "tfjob.kubeflow.org/bert-create-pretrain-data created\n",
      "\u001b[36mINFO\u001b[0m[0004] The Job bert-create-pretrain-data has been submitted successfully \n",
      "\u001b[36mINFO\u001b[0m[0004] You can run `arena get bert-create-pretrain-data --type tfjob` to check the job status \n"
     ]
    }
   ],
   "source": [
    "%env PRETRAIN_DATA_JOB_NAME=bert-create-pretrain-data\n",
    "!arena submit tf \\\n",
    "             --name=$PRETRAIN_DATA_JOB_NAME \\\n",
    "             --workers=1 \\\n",
    "             --gpus=1 \\\n",
    "             --data=training-data:/training \\\n",
    "             --image=tensorflow/tensorflow:1.11.0-gpu-py3 \\\n",
    "            \"python3 /training/models/bert/create_pretraining_data.py \\\n",
    "            --input_file=/training/models/bert/sample_text.txt \\\n",
    "            --output_file=/training/output/bert/tf_examples.tfrecord \\\n",
    "            --vocab_file=/training/dataset/bert/uncased_L-12_H-768_A-12/vocab.txt \\\n",
    "            --do_lower_case=True \\\n",
    "            --max_seq_length=256 \\\n",
    "            --max_predictions_per_seq=39 \\\n",
    "            --masked_lm_prob=0.15 \\\n",
    "            --random_seed=12345 \\\n",
    "            --dupe_factor=5; python3 -c \\\"import os;fd=os.open('/training/output/bert/tf_examples.tfrecord',os.O_NONBLOCK);os.fsync(fd)\\\"\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7.检查Pretraining data任务的状态，这个步骤不涉及大量计算，任务很快就可以完成。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "STATUS: SUCCEEDED\r\n",
      "NAMESPACE: default\r\n",
      "TRAINING DURATION: 3s\r\n",
      "\r\n",
      "NAME                       STATUS     TRAINER  AGE  INSTANCE                           NODE\r\n",
      "bert-create-pretrain-data  SUCCEEDED  TFJOB    48s  bert-create-pretrain-data-chief-0  N/A\r\n",
      "\r\n",
      "Your tensorboard will be available on:\r\n",
      "192.168.0.206:31785   \r\n",
      "\r\n",
      "Events: \r\n",
      "No events for pending pod\r\n"
     ]
    }
   ],
   "source": [
    "! arena get $PRETRAIN_DATA_JOB_NAME -e"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "8.查看创建Pretraining data的任务日志\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2019-03-01T03:19:57.848619386Z INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0\r\n",
      "2019-03-01T03:19:57.848753719Z INFO:tensorflow:next_sentence_labels: 0\r\n",
      "2019-03-01T03:19:57.849205981Z INFO:tensorflow:*** Example ***\r\n",
      "2019-03-01T03:19:57.849406004Z INFO:tensorflow:tokens: [CLS] like most of [MASK] fellow gold - seekers [MASK] cass was super ##sti [MASK] . [SEP] basket on phil ##am ##mon ' s head , and tr ##otted [MASK] up a neighbouring street . phil ##am ##mon followed , half contempt ##uous , half wondering at [MASK] this [MASK] might be [MASK] which could feed the self - con ##ce ##it of anything so ab ##ject as his ragged little api ##sh guide ; but the novel roar and w ##hir ##l of the street [MASK] the perpetual [MASK] of busy faces [MASK] the line of cu ##rri ##cles , pal ##an ##quin ##s , [MASK] ass ##es [MASK] camel ##s , elephants , [MASK] met and passed him , and squeezed him up steps and into doorway ##s , as they threaded their way through the great dwight - gate into the ample street beyond , drove everything from his mind but wondering curiosity , [MASK] a vague , helpless dread of [MASK] great living wilderness , more terrible than any [MASK] wilderness of sand which he had left behind . [MASK] he longed [MASK] the rep ##ose , the silence of [MASK] laura [MASK] - for faces which [MASK] him and smiled upon him [MASK] but it was too late to turn back now . his [MASK] held on for more than [MASK] mile [MASK] the great main street , crossed in [MASK] centre [MASK] [MASK] city , at right angles , [MASK] one equally magnificent [MASK] at [unused190] end [MASK] [MASK] [SEP]\r\n",
      "2019-03-01T03:19:57.849551229Z INFO:tensorflow:input_ids: 101 2066 2087 1997 103 3507 2751 1011 24071 103 16220 2001 3565 16643 103 1012 102 10810 2006 6316 3286 8202 1005 1055 2132 1010 1998 19817 26174 103 2039 1037 9632 2395 1012 6316 3286 8202 2628 1010 2431 17152 8918 1010 2431 6603 2012 103 2023 103 2453 2022 103 2029 2071 5438 1996 2969 1011 9530 3401 4183 1997 2505 2061 11113 20614 2004 2010 14202 2210 17928 4095 5009 1025 2021 1996 3117 11950 1998 1059 11961 2140 1997 1996 2395 103 1996 18870 103 1997 5697 5344 103 1996 2240 1997 12731 18752 18954 1010 14412 2319 12519 2015 1010 103 4632 2229 103 19130 2015 1010 16825 1010 103 2777 1998 2979 2032 1010 1998 7757 2032 2039 4084 1998 2046 7086 2015 1010 2004 2027 26583 2037 2126 2083 1996 2307 14304 1011 4796 2046 1996 20851 2395 3458 1010 5225 2673 2013 2010 2568 2021 6603 10628 1010 103 1037 13727 1010 13346 14436 1997 103 2307 2542 9917 1010 2062 6659 2084 2151 103 9917 1997 5472 2029 2002 2018 2187 2369 1012 103 2002 23349 103 1996 16360 9232 1010 1996 4223 1997 103 6874 103 1011 2005 5344 2029 103 2032 1998 3281 2588 2032 103 2021 2009 2001 2205 2397 2000 2735 2067 2085 1012 2010 103 2218 2006 2005 2062 2084 103 3542 103 1996 2307 2364 2395 1010 4625 1999 103 2803 103 103 2103 1010 2012 2157 12113 1010 103 2028 8053 12047 103 2012 195 2203 103 103 102\r\n",
      "2019-03-01T03:19:57.849677647Z INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\r\n",
      "2019-03-01T03:19:57.84980452Z INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\r\n",
      "2019-03-01T03:19:57.849810975Z INFO:tensorflow:masked_lm_positions: 4 9 14 29 47 49 52 86 89 93 96 106 109 115 117 139 150 157 164 173 183 186 194 196 201 207 219 225 227 235 237 238 245 247 249 251 253 254 0\r\n",
      "2019-03-01T03:19:57.849951006Z INFO:tensorflow:masked_lm_ids: 2010 1010 20771 2125 2054 4695 1010 1010 5460 1010 1997 14887 1010 2029 1998 4231 2013 1998 2008 2757 2525 2005 1996 1011 2354 1025 5009 1037 2039 1996 1997 1996 2011 8053 1010 2169 1997 2029 0\r\n",
      "2019-03-01T03:19:57.849956674Z INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0\r\n",
      "2019-03-01T03:19:57.850092816Z INFO:tensorflow:next_sentence_labels: 1\r\n",
      "2019-03-01T03:19:57.85060904Z INFO:tensorflow:*** Example ***\r\n",
      "2019-03-01T03:19:57.850752801Z INFO:tensorflow:tokens: [CLS] but the novel [MASK] and w [MASK] ##l [MASK] the street , the perpetual [MASK] of busy faces [MASK] the line of cu ##rri ##cles , pal ##an ##quin ##s [MASK] [MASK] ass ##es , camel ##s , elephants , [MASK] met and passed him , and squeezed him up steps and into doorway ##s [MASK] as they [MASK] their way through the great moon - gate into the ample street beyond , drove everything from his mind but wondering curiosity [MASK] and a vague , helpless dread of that great living wilderness , more terrible than any dead wilderness of sand which he had left behind . already he longed for the rep ##ose [MASK] the silence of the laura - - for faces which knew him and smiled upon him ; but it [unused946] [MASK] late to turn back now . his guide held on for blown than a mile up [MASK] [MASK] main street , crossed in the [MASK] of [MASK] [MASK] , at right angles , by one equally magnificent , [MASK] each end of which , miles away , appeared , dim [MASK] distant [MASK] the heads of the living stream [MASK] passengers [MASK] [MASK] yellow sand - hills of the desert [MASK] [SEP] looking [MASK] it more at ##ten ##tively , he saw [MASK] it bore the inscription , \" may to cass [MASK] \" like abel of his fellow gold - [MASK] , cass was super ##sti ##tious dotted [SEP]\r\n",
      "2019-03-01T03:19:57.85089608Z INFO:tensorflow:input_ids: 101 2021 1996 3117 103 1998 1059 103 2140 103 1996 2395 1010 1996 18870 103 1997 5697 5344 103 1996 2240 1997 12731 18752 18954 1010 14412 2319 12519 2015 103 103 4632 2229 1010 19130 2015 1010 16825 1010 103 2777 1998 2979 2032 1010 1998 7757 2032 2039 4084 1998 2046 7086 2015 103 2004 2027 103 2037 2126 2083 1996 2307 4231 1011 4796 2046 1996 20851 2395 3458 1010 5225 2673 2013 2010 2568 2021 6603 10628 103 1998 1037 13727 1010 13346 14436 1997 2008 2307 2542 9917 1010 2062 6659 2084 2151 2757 9917 1997 5472 2029 2002 2018 2187 2369 1012 2525 2002 23349 2005 1996 16360 9232 103 1996 4223 1997 1996 6874 1011 1011 2005 5344 2029 2354 2032 1998 3281 2588 2032 1025 2021 2009 951 103 2397 2000 2735 2067 2085 1012 2010 5009 2218 2006 2005 10676 2084 1037 3542 2039 103 103 2364 2395 1010 4625 1999 1996 103 1997 103 103 1010 2012 2157 12113 1010 2011 2028 8053 12047 1010 103 2169 2203 1997 2029 1010 2661 2185 1010 2596 1010 11737 103 6802 103 1996 4641 1997 1996 2542 5460 103 5467 103 103 3756 5472 1011 4564 1997 1996 5532 103 102 2559 103 2009 2062 2012 6528 25499 1010 2002 2387 103 2009 8501 1996 9315 1010 1000 2089 2000 16220 103 1000 2066 16768 1997 2010 3507 2751 1011 103 1010 16220 2001 3565 16643 20771 20384 102 0 0 0 0 0 0 0 0\r\n",
      "2019-03-01T03:19:57.85102435Z INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0\r\n",
      "2019-03-01T03:19:57.851149611Z INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0\r\n",
      "2019-03-01T03:19:57.851160529Z INFO:tensorflow:masked_lm_positions: 4 7 9 15 19 31 32 41 56 59 82 91 116 118 133 135 136 137 149 154 155 162 164 165 176 188 190 197 199 200 208 211 220 230 233 239 246 0 0\r\n",
      "2019-03-01T03:19:57.85132258Z INFO:tensorflow:masked_lm_ids: 11950 11961 1997 5460 1010 1010 14887 2029 1010 26583 1010 2307 1010 4223 1025 2009 2001 2205 2062 1996 2307 2803 1996 2103 2012 1998 2058 1997 1010 1996 1025 2012 2008 1012 2087 24071 1012 0 0\r\n",
      "2019-03-01T03:19:57.851330847Z INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0\r\n",
      "2019-03-01T03:19:57.851464762Z INFO:tensorflow:next_sentence_labels: 1\r\n",
      "2019-03-01T03:19:57.851968383Z INFO:tensorflow:*** Example ***\r\n",
      "2019-03-01T03:19:57.852112074Z INFO:tensorflow:tokens: [CLS] ? [MASK] and berlin little man jumped [MASK] , put his basket on phil ##am ##mon ' s head , and tr ##otted off up a neighbouring [MASK] . [MASK] ##am ##mon followed , half contempt ##uous , half [MASK] at what this philosophy might [MASK] , which [MASK] [MASK] the [MASK] - con ##ce ##it of anything so ab ##ject [MASK] his ragged little api ##sh guide ; but the novel roar [MASK] w ##hir ##l propose the street , the perpetual stream of busy faces , the line of cu ##rri ##cles , pal ##an ##quin ##s [MASK] laden ass ##es , camel ##s , elephants , which [MASK] and passed him , and bahn him up steps and into doorway ##s [MASK] as they threaded [MASK] [MASK] through the great moon - [MASK] into the ample street beyond , drove everything [MASK] his mind but wondering curiosity [MASK] and a vague , helpless dread of that great [MASK] wilderness , [MASK] terrible than any [MASK] wilderness of sand which he had left behind [MASK] already he longed for the [MASK] ##ose , the [MASK] of [SEP] [MASK] guide held on [MASK] more than a mile up the great main street , crossed in the centre of [MASK] city [MASK] at right angles , by [MASK] equally magnificent , at each end of which , miles fleets , appeared , dim and distant over the heads of the living stream of passengers , the yellow sand - hills of the desert ; [SEP]\r\n",
      "2019-03-01T03:19:57.852264474Z INFO:tensorflow:input_ids: 101 1029 103 1998 4068 2210 2158 5598 103 1010 2404 2010 10810 2006 6316 3286 8202 1005 1055 2132 1010 1998 19817 26174 2125 2039 1037 9632 103 1012 103 3286 8202 2628 1010 2431 17152 8918 1010 2431 103 2012 2054 2023 4695 2453 103 1010 2029 103 103 1996 103 1011 9530 3401 4183 1997 2505 2061 11113 20614 103 2010 14202 2210 17928 4095 5009 1025 2021 1996 3117 11950 103 1059 11961 2140 16599 1996 2395 1010 1996 18870 5460 1997 5697 5344 1010 1996 2240 1997 12731 18752 18954 1010 14412 2319 12519 2015 103 14887 4632 2229 1010 19130 2015 1010 16825 1010 2029 103 1998 2979 2032 1010 1998 17392 2032 2039 4084 1998 2046 7086 2015 103 2004 2027 26583 103 103 2083 1996 2307 4231 1011 103 2046 1996 20851 2395 3458 1010 5225 2673 103 2010 2568 2021 6603 10628 103 1998 1037 13727 1010 13346 14436 1997 2008 2307 103 9917 1010 103 6659 2084 2151 103 9917 1997 5472 2029 2002 2018 2187 2369 103 2525 2002 23349 2005 1996 103 9232 1010 1996 103 1997 102 103 5009 2218 2006 103 2062 2084 1037 3542 2039 1996 2307 2364 2395 1010 4625 1999 1996 2803 1997 103 2103 103 2012 2157 12113 1010 2011 103 8053 12047 1010 2012 2169 2203 1997 2029 1010 2661 25515 1010 2596 1010 11737 1998 6802 2058 1996 4641 1997 1996 2542 5460 1997 5467 1010 1996 3756 5472 1011 4564 1997 1996 5532 1025 102\r\n",
      "2019-03-01T03:19:57.852419325Z INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\r\n",
      "2019-03-01T03:19:57.85254519Z INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\r\n",
      "2019-03-01T03:19:57.852551662Z INFO:tensorflow:masked_lm_positions: 2 4 7 8 28 30 40 46 49 50 52 56 62 74 78 100 111 117 125 127 129 130 136 145 151 161 164 168 174 177 183 187 190 194 210 212 218 229 0\r\n",
      "2019-03-01T03:19:57.852693547Z INFO:tensorflow:masked_lm_ids: 1005 1996 5598 2039 2395 6316 6603 2022 2071 5438 2969 4183 2004 1998 1997 1010 2777 7757 1010 2027 2037 2126 4796 2013 1010 2542 2062 2757 2018 1012 16360 4223 2010 2005 1996 1010 2028 2185 0\r\n",
      "2019-03-01T03:19:57.852699495Z INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0\r\n",
      "2019-03-01T03:19:57.852834135Z INFO:tensorflow:next_sentence_labels: 0\r\n",
      "2019-03-01T03:19:57.868337702Z INFO:tensorflow:Wrote 44 total instances\r\n"
     ]
    }
   ],
   "source": [
    "! arena logs --tail=30 $PRETRAIN_DATA_JOB_NAME\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "9.创建bert pretrain 的训练任务， "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: PRETRAIN_JOB_NAME=bert-pretrain-data\n",
      "configmap/bert-pretrain-data-tfjob created\n",
      "configmap/bert-pretrain-data-tfjob labeled\n",
      "tfjob.kubeflow.org/bert-pretrain-data created\n",
      "\u001b[36mINFO\u001b[0m[0003] The Job bert-pretrain-data has been submitted successfully \n",
      "\u001b[36mINFO\u001b[0m[0003] You can run `arena get bert-pretrain-data --type tfjob` to check the job status \n"
     ]
    }
   ],
   "source": [
    "%env PRETRAIN_JOB_NAME=bert-pretrain-data\n",
    "! arena submit tf --name=$PRETRAIN_JOB_NAME \\\n",
    "                --gpus=1 \\\n",
    "                --workers=1 \\\n",
    "                --data=training-data:/training \\\n",
    "                --image=tensorflow/tensorflow:1.11.0-gpu-py3 \\\n",
    "                \"python /training/models/bert/run_pretraining.py \\\n",
    "                --input_file=/training/output/bert/tf_examples.tfrecord \\\n",
    "                --output_dir=/training/output/bert/pretraining_output \\\n",
    "                --do_train=True \\\n",
    "                --do_eval=True \\\n",
    "                --bert_config_file=/training/dataset/bert/uncased_L-12_H-768_A-12/bert_config.json \\\n",
    "                --train_batch_size=16 \\\n",
    "                --max_seq_length=256 \\\n",
    "                --max_predictions_per_seq=39 \\\n",
    "                --num_train_steps=8000 \\\n",
    "                --num_warmup_steps=10 \\\n",
    "                --learning_rate=2e-5 \\\n",
    "                --save_checkpoints_steps=4000\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "10.查看实时训练的GPU使用情况"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INSTANCE NAME               GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)           STATUS   NODE\r\n",
      "bert-pretrain-data-chief-0  0                  100%             15519.0MiB / 16276.2MiB   Running  192.168.0.210\r\n"
     ]
    }
   ],
   "source": [
    "! arena top job $PRETRAIN_JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "11.查看pretraining 的任务状态和实例情况，本示例中，我们启动了一个训练实例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "STATUS: RUNNING\r\n",
      "NAMESPACE: default\r\n",
      "TRAINING DURATION: 3m\r\n",
      "\r\n",
      "NAME                STATUS   TRAINER  AGE  INSTANCE                    NODE\r\n",
      "bert-pretrain-data  RUNNING  TFJOB    3m   bert-pretrain-data-chief-0  192.168.0.210\r\n"
     ]
    }
   ],
   "source": [
    "! arena get $PRETRAIN_JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "12.查看pretraining 的训练日志，一段时间后，可以出现`tensorflow:examples/sec`相关的日志，代表已经开始训练，以及训练的速度。\n",
    "出现由于bert pretraining 时间非常长，如果我们想要实时查看日志，可以增加`-f`参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2019-03-01T03:24:31.335467491Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/query/kernel:0, shape = (768, 768)\n",
      "2019-03-01T03:24:31.335472861Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/query/bias:0, shape = (768,)\n",
      "2019-03-01T03:24:31.335629243Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/key/kernel:0, shape = (768, 768)\n",
      "2019-03-01T03:24:31.335635168Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/key/bias:0, shape = (768,)\n",
      "2019-03-01T03:24:31.335790291Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/value/kernel:0, shape = (768, 768)\n",
      "2019-03-01T03:24:31.335795756Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/value/bias:0, shape = (768,)\n",
      "2019-03-01T03:24:31.335952831Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/dense/kernel:0, shape = (768, 768)\n",
      "2019-03-01T03:24:31.335958202Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/dense/bias:0, shape = (768,)\n",
      "2019-03-01T03:24:31.33609783Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/LayerNorm/beta:0, shape = (768,)\n",
      "2019-03-01T03:24:31.33610331Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/LayerNorm/gamma:0, shape = (768,)\n",
      "2019-03-01T03:24:31.336208538Z INFO:tensorflow:  name = bert/encoder/layer_11/intermediate/dense/kernel:0, shape = (768, 3072)\n",
      "2019-03-01T03:24:31.336345617Z INFO:tensorflow:  name = bert/encoder/layer_11/intermediate/dense/bias:0, shape = (3072,)\n",
      "2019-03-01T03:24:31.336351398Z INFO:tensorflow:  name = bert/encoder/layer_11/output/dense/kernel:0, shape = (3072, 768)\n",
      "2019-03-01T03:24:31.336498005Z INFO:tensorflow:  name = bert/encoder/layer_11/output/dense/bias:0, shape = (768,)\n",
      "2019-03-01T03:24:31.336503414Z INFO:tensorflow:  name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768,)\n",
      "2019-03-01T03:24:31.336645136Z INFO:tensorflow:  name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768,)\n",
      "2019-03-01T03:24:31.336650519Z INFO:tensorflow:  name = bert/pooler/dense/kernel:0, shape = (768, 768)\n",
      "2019-03-01T03:24:31.33680328Z INFO:tensorflow:  name = bert/pooler/dense/bias:0, shape = (768,)\n",
      "2019-03-01T03:24:31.336808531Z INFO:tensorflow:  name = cls/predictions/transform/dense/kernel:0, shape = (768, 768)\n",
      "2019-03-01T03:24:31.336957828Z INFO:tensorflow:  name = cls/predictions/transform/dense/bias:0, shape = (768,)\n",
      "2019-03-01T03:24:31.336965945Z INFO:tensorflow:  name = cls/predictions/transform/LayerNorm/beta:0, shape = (768,)\n",
      "2019-03-01T03:24:31.337111549Z INFO:tensorflow:  name = cls/predictions/transform/LayerNorm/gamma:0, shape = (768,)\n",
      "2019-03-01T03:24:31.337331045Z INFO:tensorflow:  name = cls/predictions/output_bias:0, shape = (30522,)\n",
      "2019-03-01T03:24:31.337348181Z INFO:tensorflow:  name = cls/seq_relationship/output_weights:0, shape = (2, 768)\n",
      "2019-03-01T03:24:31.337352586Z INFO:tensorflow:  name = cls/seq_relationship/output_bias:0, shape = (2,)\n",
      "2019-03-01T03:24:40.280462098Z INFO:tensorflow:Done calling model_fn.\n",
      "2019-03-01T03:24:40.282210657Z INFO:tensorflow:Create CheckpointSaverHook.\n",
      "2019-03-01T03:24:44.183074716Z INFO:tensorflow:Graph was finalized.\n",
      "2019-03-01T03:24:44.183316826Z 2019-03-01 03:24:44.183119: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA\n",
      "2019-03-01T03:24:44.351791898Z 2019-03-01 03:24:44.351531: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
      "2019-03-01T03:24:44.352922705Z 2019-03-01 03:24:44.352755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: \n",
      "2019-03-01T03:24:44.352939062Z name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285\n",
      "2019-03-01T03:24:44.352942657Z pciBusID: 0000:00:09.0\n",
      "2019-03-01T03:24:44.352945648Z totalMemory: 15.89GiB freeMemory: 15.60GiB\n",
      "2019-03-01T03:24:44.352948528Z 2019-03-01 03:24:44.352791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0\n",
      "2019-03-01T03:24:44.679981448Z 2019-03-01 03:24:44.679730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:\n",
      "2019-03-01T03:24:44.680018686Z 2019-03-01 03:24:44.679793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0 \n",
      "2019-03-01T03:24:44.680022093Z 2019-03-01 03:24:44.679801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N \n",
      "2019-03-01T03:24:44.680381343Z 2019-03-01 03:24:44.680250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15117 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 6.0)\n",
      "2019-03-01T03:24:49.827980455Z INFO:tensorflow:Running local_init_op.\n",
      "2019-03-01T03:24:49.951970475Z INFO:tensorflow:Done running local_init_op.\n",
      "2019-03-01T03:24:57.201907836Z INFO:tensorflow:Saving checkpoints for 0 into /training/output/bert/pretraining_output/model.ckpt.\n",
      "2019-03-01T03:26:08.446110241Z INFO:tensorflow:global_step/sec: 1.72217\n",
      "2019-03-01T03:26:08.44614756Z INFO:tensorflow:examples/sec: 27.5547\n",
      "2019-03-01T03:27:01.490555755Z INFO:tensorflow:global_step/sec: 1.8852\n",
      "2019-03-01T03:27:01.490728971Z INFO:tensorflow:examples/sec: 30.1632\n",
      "2019-03-01T03:27:54.542023408Z INFO:tensorflow:global_step/sec: 1.88496\n",
      "2019-03-01T03:27:54.542154644Z INFO:tensorflow:examples/sec: 30.1594\n",
      "2019-03-01T03:28:47.586233208Z INFO:tensorflow:global_step/sec: 1.88522\n",
      "2019-03-01T03:28:47.586289848Z INFO:tensorflow:examples/sec: 30.1635\n"
     ]
    }
   ],
   "source": [
    "! arena logs --tail=50 $PRETRAIN_JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "13.训练完成后，我们可以删除已经完成的任务，清理环境。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "service \"tf-distributed-mnist-tensorboard\" deleted\n",
      "deployment.extensions \"tf-distributed-mnist-tensorboard\" deleted\n",
      "tfjob.kubeflow.org \"tf-distributed-mnist\" deleted\n",
      "configmap \"tf-distributed-mnist-tfjob\" deleted\n",
      "\u001b[36mINFO\u001b[0m[0004] The Job tf-distributed-mnist has been deleted successfully \n"
     ]
    }
   ],
   "source": [
    "! arena delete $PRETRAIN_JOB_NAME\n",
    "! arena delete $PRETRAIN_DATA_JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "恭喜！您已经使用 `arena` 成功运行了训练作业\n",
    "\n",
    "总结，希望您通过本次演示如何提交Bert的pretraining任务，将包含以下几个步骤：\n",
    "1. 准备训练代码和数据，并将其放入数据卷中\n",
    "2. 提交pretraining 所需的数据处理工作\n",
    "3. 提交pretraining 的训练任务"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
